mlatoz

Linear Regression

Good

Simple to implement and efficient to train.
Overfitting can be reduced by regularization.
Performs well when the dataset is linearly separable.

Bad

Assumes that the data is independent which is rare in real life.
Prone to noise and overfitting.
Sensitive to outliers.

Simple Linear Regression

Simple Linear Regression is a statistical technique used to model the relationship between two continuous variables: a dependent variable (also known as the response or outcome variable) and an independent variable (also known as the predictor or explanatory variable).
It aims to find the best-fitting straight line through the data points to predict the values of the dependent variable based on the values of the independent variable.
The equation for a simple linear regression model is typically represented as:

y = β₀ + β₁ * x + ε

  Where:
  
    - y is the dependent variable (the variable we want to predict).
  
    - x is the independent variable (the variable used to make predictions).
  
    - β₀ is the intercept, representing the value of y when x is zero.
  
    - β₁ is the slope of the regression line, indicating how much y changes for each unit change in x.
  
    - ε represents the error term, which accounts for the variability of y that is not explained by the regression line.

The goal of simple linear regression is to estimate the values of β₀ and β₁ that minimize the sum of squared differences between the predicted values (β₀ + β₁ * x) and the actual observed values of the dependent variable. This is usually done using a method called the least squares approach.
Once the regression coefficients (β₀ and β₁) are estimated, the fitted regression line can be used to make predictions for new values of the independent variable.
Simple linear regression is a fundamental and widely used technique in statistics and machine learning for understanding and modeling the relationship between two variables, especially when the relationship appears to be linear. However, when dealing with more complex relationships, multiple linear regression or other advanced regression techniques may be more appropriate.

Ordinary Least Squares

Ordinary Least Squares (OLS) is a method used in simple linear regression to estimate the parameters of a linear relationship between two variables. In simple linear regression, we have a dependent variable (Y) and an independent variable (X), and we want to find the best-fitting line that represents the linear relationship between them. The equation of the line is given by:

Y = β₀ + β₁ * X

  Where:
  
    - Y is the dependent variable (the one we want to predict or explain).
  
    - X is the independent variable (the predictor or explanatory variable).
  
    - β₀ is the intercept (the value of Y when X is 0).
  
    - β₁ is the slope (the change in Y for a one-unit change in X).

The goal of the OLS method is to find the values of β₀ and β₁ that minimize the sum of squared differences between the observed values of Y (Y_i) and the predicted values (Ŷ_i) from the linear equation for all data points (i) in the dataset.
Mathematically, the OLS estimates of β₀ and β₁ are obtained as follows:

β₁ = Σ((X_i - X̄)(Y_i - Ȳ)) / Σ((X_i - X̄)²) β₀ = Ȳ - β₁ * X̄

  Where:
  
    - Σ represents the sum of.
  
    - X_i is the value of the independent variable for the i^th data point.
  
    - Y_i is the value of the dependent variable for the i^th data point.
  
    - X̄ is the mean of all X values.
  
    - Ȳ is the mean of all Y values.

The OLS method is called “least squares” because it minimizes the sum of the squared vertical distances between the observed data points and the regression line. The line obtained through OLS is the “best-fitting” line because it minimizes the total squared error between the observed values and the predicted values.
Once you have estimated the values of β₀ and β₁ using OLS, you can use the linear equation (Y = β₀ + β₁ * X) to predict the value of the dependent variable Y for any given value of the independent variable X. Additionally, you can assess the goodness of fit of the regression model and make inferences about the relationship between the two variables using statistical tests and measures such as R-squared, t-tests, etc.

Python Code Template

Traditional Template

  # Import necessary libraries
  import numpy as np
  import pandas as pd
  import matplotlib.pyplot as plt
  import statsmodels.api as sm

  # Load the data
  data = pd.read_csv("1.01. Simple linear regression.csv")

  # Define the dependent and the independent variables
  y = data['GPA']
  x1 = data['SAT']

  # Explore the data
  plt.scatter(x1, y)
  plt.xlabel('SAT', fontsize=20)
  plt.ylabel('GPA', fontsize=20)
  plt.show()

  # Regression itself
  x = sm.add_constant(x1)
  results = sm.OLS(y, x).fit()
  results.summary()

  # Plotting the graph
  plt.scatter(x1, y)

  yhat = 0.0017 * x1 + 0.275
  fig = plt.plot(x1, yhat, lw = 4, c='orange', label='regression line')
  
  plt.xlabel('SAT', fontsize=20)
  plt.ylabel('GPA', fontsize=20)
  plt.show()

  # Formula Method --> ŷ = b₀ + b₁x₁
  plt.scatter(x1, y)

  yhat = 0.0017 * x1 + 0.275
  fig = plt.plot(x1, yhat, lw = 4, c='orange', label='regression line')

  plt.xlabel('SAT', fontsize=20)
  plt.ylabel('GPA', fontsize=20)
  plt.show()

Modern Template

  # Import necessary libraries
  import numpy as np
  import matplotlib.pyplot as plt
  from sklearn.model_selection import train_test_split
  from sklearn.linear_model import LinearRegression
  from sklearn.metrics import mean_squared_error
  
  # Generate some example data
  np.random.seed(42)
  X = 2 * np.random.rand(100, 1)
  y = 4 + 3 * X + np.random.randn(100, 1)
  
  # Split the data into training and testing sets
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  
  # Create a Linear Regression model
  model = LinearRegression()
  
  # Train the model on the training data
  model.fit(X_train, y_train)
  
  # Make predictions on the test data
  y_pred = model.predict(X_test)
  
  # Evaluate the model
  mse = mean_squared_error(y_test, y_pred)
  rmse = np.sqrt(mse)
  print(f"Root Mean Squared Error: {rmse}")
  
  # Plot the training data and the regression line
  plt.scatter(X_train, y_train, label='Training Data')
  plt.scatter(X_test, y_test, label='Test Data')
  plt.plot(X_test, y_pred, color='red', linewidth=3, label='Regression Line')
  plt.xlabel('X-axis label')
  plt.ylabel('Y-axis label')
  plt.title('Simple Linear Regression')
  plt.legend()
  plt.show()

Download Resources

«Previous